-
Notifications
You must be signed in to change notification settings - Fork 198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node metrics #948
Node metrics #948
Conversation
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
node/grpc/server_v2.go
Outdated
@@ -101,6 +110,8 @@ func (s *ServerV2) StoreChunks(ctx context.Context, in *pb.StoreChunksRequest) ( | |||
return | |||
} | |||
|
|||
s.metrics.ReportStoreChunksDataSize(size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if the store operation gets reverted in L125?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a general rule of thumb, should we report incremental metrics if the operation as a whole fails? Or should we only report metrics for an operation if it is successful? (in another PR, you suggested that I should report latencies even when there are failures).
I can make this only report if the request ends up being valid, but I want to be consistent with the way we handle scenarios like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline, this will be left the way it currently is.
node/grpc/v2_metrics.go
Outdated
|
||
for m.isAlive.Load() { | ||
var size int64 | ||
err := filepath.Walk(m.dbDir, func(_ string, info os.FileInfo, err error) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this thread safe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is almost certainly not thread safe (i.e. if level DB deletes a file or a directory mid-walk, then filepath.Walk()
will return an error). My hope was that if the race condition was sufficiently rare, we could still extract meaningful metrics data.
Currently, this will log an error whenever this method is unable to fetch new data. Will an error log cause problems if it triggers every once in a while? If so, should this be downgraded to a a logger.info()
call?
Unfortunately, levelDB doesn't expose API that tells you the size of the DB (that I know of). My reasoning was that this metric would be sufficiently valuable to justify a hacky collection method.
In theory, we could have the levelDB wrapper track the quantity of data, at the cost of some extra book keeping (every DB modification would need to update a special size
key-value pair). This wouldn't tell us the size of the files on disk (which may vary depending on things like compaction and indexes), but would give us a very good idea of the approximate size if the DB. If I implemented such a thing, it would need to be in a stand alone PR.
The final option would be to just delete this metric entirely. I'll defer to your judgement on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This metric is now removed.
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
node/grpc/v2_metrics.go
Outdated
|
||
func (m *V2Metrics) ReportStoreChunksLatency(latency time.Duration) { | ||
m.storeChunksLatency.WithLabelValues().Observe( | ||
float64(latency.Nanoseconds()) / float64(time.Millisecond)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
float64(latency.Milliseconds()) should work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm intentionally not using that (@ian-shim made the same suggestion on another PR 😜). latency.Milliseconds()
returns an int
, meaning we lose all sub-millisecond fidelity in the measurement. For many metrics this isn't a hug deal (e.g. if you are measuring something that takes 100s of milliseconds), but for some things we are measuring the precision is nice to have.
Let me know if you'd like to discuss this further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it depends on what the expected number we will deal with here. If it's ultra low latency and sub-ms matters, Microseconds()
may be an option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the sake of consistency, I've been converting everything to ms
for reporting latencies.
Although I can often make an educated guess as to the expected latency for an operation, absent experimental data it's only a guess. If I guess wrong, then we could end up in a situation where we guess wrong.
Would this be a topic you think worthwhile to schedule a short call to discuss?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, all of these now use a utility method in common.
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Why are these changes needed?
Adds metrics to the v2 DA node.
Checks